Predicting influence in social networks

ABSTRACT

A method, system and computer program product are disclosed for predicting influence in a social network. In one embodiment, the method comprises identifying a set of users of the social network, and identifying a subset of the users as influential users based on defined criteria. A multitude of measures are identified as predictors of which ones of the set of users are the influential users. These measures are aggregated, and a composite predictor model is formed based on this aggregation. This composite predictor model is used to predict which ones of the set of users will have a specified influence in the social network in the future. In one embodiment, the specified influence is based on messages sent from the users, and for example, may be based on the number of the messages sent from each user that are re-sent by other users.

BACKGROUND OF THE INVENTION

The present invention generally relates to social network analysis, andmore specifically, to predicting influence in social networks.

Over the past decade, the Internet has created new channels and enormousopportunities for companies to reach customers, advertise products, andtransact business. In this well established business model, companiesfully control their own web-based reputation via the content appearingon their websites. The advent of Web 2.0, with its emphasis oninformation sharing and user collaboration, is fundamentally alteringthis landscape. Increasingly, the focal point of discussion of allaspects of a company's product portfolio is moving from individualcompany websites to collaborative sites, blogs, and forums—collectivelyknown as Social Media. In this new media essentially anyone can postcomments and opinions about companies and their products, which mayinfluence the perceptions and purchase behavior of a large number ofpotential buyers. This is of obvious concern to marketingorganizations—not only is the spread of negative information difficultto control, but it can be very difficult to even detect it in the largespace of blogs, forums, and social networking sites.

The extent to which any reputation can be impacted by a negative storydepends heavily on where the story first appears. Negative sentimentposted on an influential blog is clearly more damaging than if itappears on an inconsequential blog. Conversely, marketing people maywish to inject a positive view into the blogosphere, and hence they needto know who are the most influential bloggers relevant to a specifictopic. For this reason, rigorous measures of influence and authority areessential to social media marketing.

Micro-blogs like Twitter have raised the stakes even further relative toconventional blogs. Literally within minutes, a story or opinion canspread to millions of individuals. Clearly, the speed with which such astory propagates depends on the degree of influence carried by the nodesthat immediately adopt the story.

Identifying the most important or prominent actors in a network has beenan area of much interest in Social Network Analysis dating back toMoreno's work in the 1930's [J. Moreno, Who shall Survive? Foundationsof Sociometry, Group Psychotherapy and Sociodrama. Washington D.C.:Nervous and Mental Disease Publishing Co., 1934.]. This interest hasspurred the formulation of many graph-based socio-metrics for rankingactors in complex physical, biological and social networks. Thesesociometrics are usually based on intuitive notions such as access andcontrol over resources, or brokerage of information [D. Knoke and R.Burt, Applied Network Analysis. Newbury Park, Calif.: Sage, 1983, ch.Prominence.]; and has yielded measures such as Degree Centrality,Closeness Centrality and Betweeness Centrality [S. Wasserman and K.Faust, Social Network Analysis: Methods & Applications. Cambridge, UK:Cambridge University Press, 1994.].

In the exploratory analysis of networks, the question of whether thesemeasures of centrality really capture what we mean by “importance” isoften not directly addressed. However, when such sociometrics startbeing used to drive decisions in more quantitative fields, there emergesa need to empirically answer this question. Probably the most popular ofthese measures in the Data Mining community is PageRank, which is avariant of Eigenvector Centrality [L. Katz, “A new status index derivedfrom sociomertric analysis,” Psychometika, vol. 18, pp. 39-43, 1953.].Once its use in Information Retrieval (IR) and Web search in particularbecame popular, it led to more rigorous evaluation of PageRank andvariants on measurable IR tasks [M. Richardson, A. Prakash, and E.Brill, “Beyond pagerank: machine learning for static ranking,” in WWW,2006.], [T. H. Haveliwala, “Topic-sensitive pagerank: Acontext-sensitive ranking algorithm for web search,” IEEE Trans. onKnowl, and Data Eng., vol. 15, no. 4, pp. 784-796, 2003].

With the rise of Web 2.0, with its focus on user-generated content andsocial networks, various socio-metrics are being increasingly used toproduce ranked lists of “top” bloggers, twitterers, etc. Do theserankings really identify “influential” authors, and if so, which rankingis better? With the increased demand for Social Media Analytics, withits focus on deriving marketing insight from the analysis of blogs andother social media, there is a growing need to address this question.

BRIEF SUMMARY

Embodiments of the invention provide a method, system and computerprogram product for predicting influence in a social network. In oneembodiment, the invention provides a method comprising identifying a setof users of the social network, and identifying a subset of the users asinfluential users based on defined criteria. A multitude of measures areidentified as predictors of which ones of the set of users are theinfluential users. These measures are aggregated, and a compositepredictor model is formed based on this aggregation. This compositepredictor model is used to predict which ones of the set of users willhave a specified influence in the social network in the future.

In one embodiment, messages are sent by and among the set of users ofthe network, and the specified influence is based on the messages sentfrom the users; and, for example, in an embodiment, the specifiedinfluence may be based on the number of messages sent from each userthat are re-sent by other users.

In an embodiment, a training set of data is used to determine, for eachof the measures, an effectiveness of the measure as predicting whichones of the set of users sent the messages that were most re-sent byother users. In this embodiment, the measures are aggregated based onthe determined effectiveness of each of the measures. In one embodiment,all of the multitude of measures are used to form the compositepredictor model. In an embodiment, only some of the multitude ofmeasures are used to form the composite predictor model. In oneembodiment, the composite predictor model is formed by combining all ofsaid measures through logistic regression.

In one embodiment, the set of users is separated into different classesaccording to degrees of influence of the users based on said definedcriteria, and the composite predictor model is used to predict how oftenusers in these different classes will have the specified effect in thesocial network in the future.

In an embodiment, the set of users is separated into different classesaccording to a degree of influence of each of the users based on saiddefined criteria, and the aggregating of the measures is based on anaccuracy of each of the measures as a predictor of which of the usersare in which of the classes.

In an embodiment, different weights are assigned to the multitude ofmeasures to form a supervised rank aggregation of these measures, andthis supervised rank aggregation is used to form the composite predictormodel.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 shows graphs based on three different relationships between usersof a social network.

FIG. 2 illustrates a procedure for evaluating measures of influence in asocial network.

FIG. 3 is a graph showing the distribution of activity of a set ofTwitter users.

FIG. 4 is a graphs illustrating the results of combining differentsocio-metrics to predict influential people in a social network.

FIG. 5 is a graph showing PageRanks on different graphs.

FIG. 6 shows two algorithms that may be used in embodiments of theinvention.

FIG. 7 is a graph comparing supervised vs. unsupervised rank aggregationmethods.

FIG. 8 illustrates the performance of rank aggregation techniques withincreasing training data.

FIG. 9 shows a computing environment that may be used to implement anembodiment of the invention.

DETAILED DESCRIPTION

As will be appreciated by one skilled in the art, embodiments of thepresent invention may be embodied as a system, method or computerprogram product. Accordingly, embodiments of the present invention maytake the form of an entirely hardware embodiment, an entirely softwareembodiment (including firmware, resident software, micro-code, etc.) oran embodiment combining software and hardware aspects that may allgenerally be referred to herein as a “circuit,” “module” or “system.”Furthermore, embodiments of the present invention may take the form of acomputer program product embodied in any tangible medium of expressionhaving computer usable program code embodied in the medium.

Any combination of one or more computer usable or computer readablemedium(s) may be utilized. The computer-usable or computer-readablemedium may be, for example but not limited to, an electronic, magnetic,optical, electromagnetic, infrared, or semiconductor system, apparatus,device, or propagation medium. More specific examples (a non-exhaustivelist) of the computer-readable medium would include the following: anelectrical connection having one or more wires, a portable computerdiskette, a hard disk, a random access memory (RAM), a read-only memory(ROM), an erasable programmable read-only memory (EPROM or Flashmemory), an optical fiber, a portable compact disc read-only memory(CDROM), an optical storage device, a transmission media such as thosesupporting the Internet or an intranet, or a magnetic storage device.Note that the computer-usable or computer-readable medium could even bepaper or another suitable medium, upon which the program is printed, asthe program can be electronically captured, via, for instance, opticalscanning of the paper or other medium, then compiled, interpreted, orotherwise processed in a suitable manner, if necessary, and then storedin a computer memory. In the context of this document, a computer-usableor computer-readable medium may be any medium that can contain, store,communicate, propagate, or transport the program for use by or inconnection with the instruction execution system, apparatus, or device.The computer-usable medium may include a propagated data signal with thecomputer-usable program code embodied therewith, either in baseband oras part of a carrier wave. The computer usable program code may betransmitted using any appropriate medium, including but not limited towireless, wireline, optical fiber cable, RF, etc.

Computer program code for carrying out operations of the presentinvention may be written in any combination of one or more programminglanguages, including an object oriented programming language such asJava, Smalltalk, C++ or the like and conventional procedural programminglanguages, such as the “C” programming language or similar programminglanguages. The program code may execute entirely on the user's computer,partly on the user's computer, as a stand-alone software package, partlyon the user's computer and partly on a remote computer or entirely onthe remote computer or server. In the latter scenario, the remotecomputer may be connected to the user's computer through any type ofnetwork, including a local area network (LAN) or a wide area network(WAN), or the connection may be made to an external computer (forexample, through the Internet using an Internet Service Provider).

The present invention is described below with reference to flowchartillustrations and/or block diagrams of methods, apparatus (systems) andcomputer program products according to embodiments of the invention. Itwill be understood that each block of the flowchart illustrations and/orblock diagrams, and combinations of blocks in the flowchartillustrations and/or block diagrams, can be implemented by computerprogram instructions. These computer program instructions may beprovided to a processor of a general purpose computer, special purposecomputer, or other programmable data processing apparatus to produce amachine, such that the instructions, which execute via the processor ofthe computer or other programmable data processing apparatus, createmeans for implementing the functions/acts specified in the flowchartand/or block diagram block or blocks. These computer programinstructions may also be stored in a computer-readable medium that candirect a computer or other programmable data processing apparatus tofunction in a particular manner, such that the instructions stored inthe computer-readable medium produce an article of manufacture includinginstruction means which implement the function/act specified in theflowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer orother programmable data processing apparatus to cause a series ofoperational steps to be performed on the computer or other programmableapparatus to produce a computer implemented process such that theinstructions which execute on the computer or other programmableapparatus provide processes for implementing the functions/actsspecified in the flowchart and/or block diagram block or blocks.

The present invention relates to social network analysis, and morespecifically, to predicting influence in social networks. The questionof whether a particular influence measure is good is ill-posed, unlessit is put in the context of a measurable task or desired outcome.Constructing such predictive tasks of interest, not only guides thechoice of relationships, but also allows for the quantitative comparisonof different socio-metrics.

Taking a predictive perspective of measures of influence can alsosuggest alternative socio-metrics, and embodiment of the inventioncombine aspects of different measures to produce a composite rankingmechanism that is most beneficial for each desired predictive task.Embodiments of the invention compare several approaches to combininginfluence measures through rank aggregation methods, such asapproximation of Kemeny optimal aggregation [G. Dwork, R. Kumar, R.Naor, and D. Sivakumar, “Rank aggregation methods for the web,” in WWW,2001]. In addition, embodiments of the invention use novel supervisedrank aggregation techniques that leverage the ground truth on a subsetof users to further improve ranking.

One embodiment of the invention that has been actually reduced topractice is based on a case study of forty-million users of the socialnetworking website Twitter.

Twitter is a free social networking and microblogging service thatenables its users to post and read messages known as tweets. Tweets areshort texts of up to 140 characters. They are displayed on the profilepage of the author, as a response to any search that matches thekeywords, and delivered to the author's subscribers who are known asfollowers. All users can send and receive tweets via the Twitterwebsite, Short Message Service (SMS) or external applications. Many ofthe more formal interactions discussed below (retweet, mention andreply) have grown out of the usage patterns and are only now starting tobe formalized and embedded in the Twitter interface.

Follower/Friend: Any Twitter user i can choose to follow another user k.This primarily means that user i will be able to see the tweets postedby user k. User i is called a follower of user k and sometimes theinverse relationship is denoted as k being a friend of i.

Retweet: When a user i wants to share a tweet by k with his followers,he can retweet that tweet. This will add the username of k in the form“RT @k” to the beginning of the tweet and copy the content of theoriginal after that.

Reply: Some users communicate via Twitter with each other. If user iwants to talk to or reply to user k the tweet would start with “@k”followed by the message.

Mention: If you do not want to talk to somebody but rather aboutsomebody the username appears somewhere later in the tweet and it willbe called a mention.

Hashtags: These tokens originated as a community-driven convention foradding metadata to tweets. They appear in the tweet preceded by a “#”,and are primarily used to create groupings.

Data Collection and Graph Construction

The first set of data that we collected through Twitter's Search API wasby searching on terms referring to “Company A”.

Base User List: A keyword search on “Company A” generated a list oftweets available for the last 7-10 days. This data was gatheredregularly for a month. We then built a list of 9625 users that had madethe most recent tweets in this set. This was our base set of users forthe balance of the study.

We are interested in the social networks and how people influence eachother leading to viral phenomena. The first question is how do we definethe social Twitter graph? One answer would be the follower relationship.The advantage of the follower graph illustrated in FIG. 1, is that it iseasily obtainable through the Twitter API. However, many users have 100Kor more friends and therefore following may not be a very strongrelationship that really influences behavior. For this reason, weexamine two other graphs also described below, illustrated in FIG. 1.

Follower Graph: The follower graph was generated by obtaininginformation about each user through the Twitter REST API. We startedwith the base list of 9625 users, and added all their followers. Thisfirst iteration added ˜2.5 million new unique users. A second iterationof the same process results in ˜35 million users. Note that this doesnot imply that the users are active, only that they were in the systemduring our data pull between Nov. 23 and Nov. 29, 2009.

TABLE I GRAPH STATISTICS Follower Retweet Mention Statistic Graph GraphGraph Number of users 39,855,505 412,774 637,509 Number of edges1,098,443,217 975,326 1,473,361 Max. in-degree 3,985,581 9,860 14,313Max. out-degree 153,926 1,318 4,427

In addition to the somewhat static Follower Graph, we consider twoalternative, implicitly embedded graphs that reflect the users currentbehavior: the Retweet Graph and the Mention Graph. We build graphs ofwho retweeted whom and who is talking to or about whom. Unfortunatelythis information is not directly available and has to be extracted fromthe tweets. As noted above, both retweet and mention/reply will createtags that start with a “@”, followed by the name of the user that isreplied to, mentioned, or retweeted. To identify tweets that containthis information, we started with our base list of 10K users. We thensearch for all tweets that contained any of the usernames in our baselist preceded by “@”. We obtained a unique list of users that made thesetweets—they either mentioned a user in the base list or retweeted atweet by a user in the base list. Using this list we pulled a seconditeration of tweets. We repeated this one more time to generate a thirditeration. The three iterations cover tweets during the time period fromNov. 11 to Nov. 19, 2009. From these three iterations, we can extract a)links between users originating from retweets and b) links between usersthat reflect mentions or replies.

Retweet Graph and Weighted Retweet Graph: We parse all three iterationsof tweets from the previous searches and extract only those tweets by auser i that start with the indicators of retweets, such as “RT @k”. Alink from user i to user k in this graph means that i is retweeting k.We generate two versions of the retweet graph, one collapsing all repeatretweets from the same user i to the user k into just one edge. Thesecond version uses the number of retweet instances as edge weights.

Mention Graph and Weighted Mention Graph: The only difference in theconstruction of the Mention Graph over the Retweet Graph is that weignore any retweets starting with “RT” and only look for occurrences ofnames in the text of the tweet. We generate two versions of the MentionGraph just as we did for the Retweet Graph. The dimensions of thesegraphs are presented in Table I.

Test Data: Ultimately we want to predict outbursts of retweets. So oncethese three graphs were generated, we continued obtaining additionaldata over the next week from Nov. 26 to Dec. 3, 2009. We gathered thefirst iteration of the retweet graph to keep track of how often theusers in the base list were retweeted. We also collected all of thetweets by the original 10K users.

Measure Definition

TABLE II MEASURE DEFINITIONS Measure Definition Followers Follower GraphIndegree Friends Follower Graph Outdegree Follower Pagerank FollowerGraph Pagerank Distinct Past Retweets Retweet Graph Indegree PeopleRetweeted Retweet Graph Outdegree Retweet Pagerank Retweet GraphPagerank Past Retweets Weighted Retweet Graph Indegree Retweets MadeWeighted Retweet Graph Outdegree Distinct Mentions Received MentionGraph Indegree People Mentioned Mention Graph Outdegree Mention PagerankMention Graph Pagerank Mentions Received Weighted Mention Graph IndegreeMentions Made Weighted Mention Graph Outdegree

For any directed graph we can compute the following measures on thenodes: in-degree (number of arcs pointing to the node), out-degree(number of arcs pointing from the node), PageRank (measure described in[S. Brin and L. Page, “The anatomy of a large-scale hypertextual websearch engine,” Computer Networks, vol. 30, no. 1-7, pp. 107-117,1998.]). We generate these measures for each of the three graphs and theweighted variants described earlier. Table II lists the graph measures,along with more intuitive names, and some statistics on each for our 10Kbase user list. We also include the target variable, i.e., the number ofretweets during the test phase, and the number of tweets during the samephase for these users. Instead of using the raw graph-based measuresdescribed above, we transformed them into percentiles. Empirically wefound that percentile rank transformation (essentially converting itinto a uniform distribution) worked better for modeling than using theraw values or a log-transformation.

Activity Statistics

Additionally, we analyzed Twitter user activity data for 2.4 millionusers in our data set. We found that 713,464 of these users had notweets in over 30 days, and are possibly inactive or read-only users.The average tweets per day was 3.24 (4.40 if we exclude the inactiveusers). FIG. 2 shows the distribution of user activity in this universe.

Identifying Viral Potential

One of the biggest opportunities and threats presented by social mediais the viral outbreak of messages, videos, tweets, memes [J. Leskovec,L. Backstrom, and J. Kleinberg, “Meme-tracking and the dynamics of thenews cycle,” in KDD, 2009.], etc. For marketing and PR organizationsthis can be a boon or curse based on the sentiment expressed in thesemessages towards specific brands, products or entities. As such,marketers are constantly looking for ways to influence positiveoutbreaks or thwart negative ones. Either way, they often base theiractions on the perceived importance of authors in the social mediaspace.

In the micro-blogging universe of Twitter, this suggests that a usefultask would be to predict which twitterers will be significantlyrebroadcasted via retweets. We construct such a task from our data bydividing users in our test phase into two classes—people who have beenretweeted 100 [8] or more times within a week, and those who have not.Roughly 1.6% of our population (151 people) fall in the first targetclass. With reference to FIG. 3, we treat this as a binaryclassification problem, where we use logistic regression to train amodel on the different ranking measures on historical data, and then useit to predict the potential for viral retweeting in the test timeperiod.

With reference to FIG. 4, since we are primarily concerned with how wellthese measures perform at ranking users, we compare the area under theROC curve (AUC) based on using each measure by itself, as well as acomposite model that uses all measures as independent variables in alogistic regression model. In FIG. 4, curves 41 and 42 show the truepositive rate vs. false positive rate using, respectively, peoplementioned and followers as the metric. Curve 44 shows the true positivevs. false positive using past retweets as the metric, and curve 45 showsthe true positive rate vs. the false positive rate using a compositeranking as the metric.

The results of our 10-fold cross-validation experiments are summarizedin Table III. We find that 9 of the 13 measures by themselves are quiteeffective at ranking the top potentially viral twitterers with anAUC>0.8. Not surprisingly, the total number of times that someone hasbeen retweeted in the recent past, as well the number of distinct peoplewho have retweeted this person, are the most predictive measures.However, just using the number of followers produces an equally goodranking. Note that the Spearman rank correlation between Distinct PastRetweets and Followers is not high (0.43), suggesting that there aremultiple forces at work here.

Pageranks on the Retweet Graph and Follower Graph also perform well(FIG. 5), but not as well the in-degree measures on the correspondinggraphs. This may suggest that for people who are retweeted a lot, it issufficient to deduce this from their immediate neighbors in the FollowerGraph and Retweet Graph. It would appear that the intuition behindPageRank, that links from people with higher ranks are more important,does not apply in this case. In FIG. 5, the Retweet graph is shown at51, the Follower graph is shown at 52, and the Mention graph is shown at53.

Finally, we observe that combining all measures through logisticregression provides a substantially better ranking than any one of thesemeasures in isolation. This improvement can be seen in FIG. 4, where wecompare the composite ranking to the best individual measure

TABLE III COMPARING DIFFERENT RANKING MEASURES FOR IDENTIFYING POTENTIALFOR VIRAL REBROADCASTING. Measure AUC Past Retweets 0.884 Distinct PastRetweets 0.884 Followers 0.884 Retweet Pagerank 0.870 Follower Pagerank0.862 People Retweeted 0.844 Retweets Made 0.842 People Mentioned 0.841Mentions Made 0.828 Friends 0.760 Mention Pagerank 0.685 DistinctMentions Received 0.560 Mentions Received 0.560 Composite 0.953from each graph. In addition, we performed ablation studies, where webuilt three additional composite models as before but ignored either theRetweet Graph, Follower Graph or Mention Graph in each. The rankingresults, presented in Table IV, show that removing measures from any ofthe three graphs diminishes our ability to identify viral potential.This underscores the fact that each aspect (network of followers,diffusion of past retweets, and interactions through replies andmentions) contributes to one's potential to reach a large audience. Byfocusing on selecting a single centrality measure to capture influencewe would miss out on the opportunity to more precisely detectpotentially viral tweets.

TABLE IV COMPARING PERFORMANCE WITHOUT USING MEASURES FROM EACH GRAPH.THE BASELINE AUC IS 0.953. Ablation AUC No Retweet Graph 0.926 NoMention Graph 0.948 No Follower Graph 0.908Predicting Level of Attention

Though identifying the most influential players is usually paramount,marketing organization are also keen to understand their entire customerbase. This suggests an alternative to focusing on the most retweetedindividuals, which is to predict different levels of audience attentiona twitterer receives. For this task, we divide users into four classesbased on the amount a person is retweeted, and try to predict this levelof retweeting from historical data. The class definitions and sizes arepresented in Table V. As before we run experiments comparing theperformance of each individual ranking measure versus logisticregression applied to all measures. Since we care about effectivelysegmenting the population, we compare overall classification accuracy onthe 4-class problem, presented in Table VI.

TABLE V DEFINITION OF CLASSES BASED ON LEVEL OF ATTENTION. Class Rangeof retweets Class Size None 0 4743 Low  1-19 3936 Medium 20-99 795High >99 151

As in the task of predicting viral potential, we observe that whileseveral measures help discriminate between classes, the composite ofthese measures performs best. This reinforces our position thatdifferent aspects of each networks can all contribute bits ofinformation that help predict the level of attention an author willreceived more effectively. Note that the best overall accuracy of 69% isnot high, which is an artifact of the high imbalance in class sizes. Theranking on each class however, is quite good, ranging from an AUC of0.77 to 0.95. Nevertheless, we use accuracy as it serves as a goodsummary statistic, allowing us to compare relative performance acrossclasses.

It is important to note that the top predictive measures for this taskare different from the previous task. Notably, Retweet Pagerank is moreeffective than Past Retweets, which was the best ranking measure for theviral task. One possible explanation is that, while Past Retweets issufficient in identifying the top retweeted, the level of second degreeretweeting is more relevant when discriminating the users with low andmedium levels of audience attention.

TABLE VI COMPARING DIFFERENT RANKING MEASURES FOR PREDICTING THEATTENTION A TWITTERER WILL RECEIVE. Measure Accuracy Retweet Pagerank64.89 Followers 64.52 People Mentioned 63.63 People Retweeted 63.40Retweets Made 63.38 Follower Pagerank 63.31 Mentions Made 62.60 Friends61.79 Past Retweets 61.16 Distinct Past Retweets 60.87 Mention Pagerank60.51 Mentions Received 53.74 Distinct Mentions Received 53.49 Composite69.07

There are a lot of factors that could influence different levels ofretweeting that are beyond what is captured by the network measures weexamine, e.g., hot topics in the news. We would like to clarify that ourobjective is not to develop the best model for prediction on this task.Instead, we want to illustrate that constructing tasks such as this,allows us to quantitatively compare socio-metrics and aid in theselection of one or a composite of such centralities.

Rank Aggregation

In the discussion above, we have shown that combining influence measuresperforms better than any single measure. So far, we have used logisticregression in order to combine the scores produced by each influencemeasure to produce a composite score. However, given that the individualinfluence measures produce an ordering of elements and not just apoint-wise score, we can, instead leverage approaches of aggregatingrankings. Methods for rank aggregation or preference aggregation havebeen used extensively in Social Choice Theory, where there is no groundtruth ranking, and as such are unsupervised. Here, we introduce severalsupervised approaches to rank aggregation that can be trained based onthe ground-truth ordering of a subset of elements. In the discussionbelow, we compare several unsupervised and supervised approaches ofaggregating rankings for the task of predicting viral potential.

The Rank Aggregation Task

We begin by formally defining the general task of rank aggregation.Given a set of entities S, let V be a subset of S and assume that thereis a total ordering among entities in V. We are given r individualrankers τ₁, . . . , τ_(r) who specify their order preferences of the mcandidates, where m is size of V, i.e., τ_(i)=[d₁, . . . , d_(m)], i=1,. . . , r, if d₁> . . . >d_(m), d_(j)εV, j=1, . . . , m. Rankaggregation function takes input orderings from r rankers and gives τ,which is an aggregated ranking order. If V equals S, then τ is called afull list (total ordering), otherwise it is called a partial list(partial ordering).

All commonly-used rank aggregation methods, satisfy one or more of thefollowing desirable goodness properties [H. Young and A. Levenglick, “Aconsistent extension of condorcet's election principle,” in SIAM Journalof Applied Math, vol. 35(2), 1978, pp. 285-300.].

-   -   Unanimity: If every ranker prefers d_(i) to d_(j), i≠j, then the        aggregate ordering prefers d_(i) to d_(j).    -   Non-dictatorial: At least two voters should have the potential        to affect the outcome.    -   Independence from irrelevant alternatives: Relative order of        d_(i) and d_(j) in aggregate only depends on relative order of        d_(i) and d_(j) in input orderings.    -   Neutrality: If two candidates switch positions in every input        ordering, then they must switch positions in aggregate.    -   Consistency: If rankers are split into P and Q, and both the        aggregate of P and aggregate of Q prefers d_(i) to d_(j), then        aggregation P∪Q should also prefer d_(i) to d_(j).    -   Condorcet Criterion: A Condorcet winner, player preferred by        most rankers to any other candidate, should be ranked first in        aggregate.    -   Extended Condorcet Criterion (ECC): Let us split the entities        into two partitions, Q and R. If for all d_(i)εQ and d_(j)εR, a        majority of rankers prefer d_(i) to d_(j), then the aggregate        should prefer d_(i) to d_(j).

In the discussion below, we compare rank aggregation methods thatsatisfy some of the above properties, as well as supervised versions ofthese methods.

Borda Aggregation

In Borda rank aggregation [J. Borda, “Memoire sur les elections auscrutin,” in Histoire de l'Academie Royale des Scinences, 1781.], eachcandidate is assigned a score by each ranker; where the score for acandidate is the number of candidates below him in each ranker'spreferences. The Borda aggregation is the descending order arrangementof the average Borda score for each candidate averaged across all rankerpreferences. Borda satisfies all goodness characteristics exceptCondorcet and Extended Condorcet Criteria. In fact, it has been shownthat no method that assigns weights to each position and then sorts theresults by applying a function to the weights associated with eachcandidate satisfies the Condorcet criterion. This includes the logisticregression approach we used in previous sections. This motivates us toconsider order-based methods for rank aggregation that satisfies bothCondorcet criteria.

Kemeny Aggregation

Kemeny is an order-based aggregation method [J. Kemeny, “Mathematicswithout numbers,” in In Daedalus, vol. 88, 1959, pp. 571-591.], in whichthe final rank aggregation reduces the number of pairwise disagreementsbetween all the rankers, i.e., the average Kendall-Tau distance betweento τ_(i), i=1, . . . , r is minimum. Kemeny Optimal Aggregation is theonly function that is neutral, consistent and satisfies the Condorcetcriteria. However, it has been shown that computing Kemeny aggregationfor r≧4, is NP-Hard. So, instead we use Local Kemenization [C. Dwork, R.Kumar, R. Naor, and D. Sivakumar, “Rank aggregation methods for theweb,” in WWW, 2001.], which is a relaxation of Kemeny Optimalaggregation that still satisfies the Extended Condorcet Criterion.

Local Kemenization is computationally tractable in practice, as opposedto Kemeny Optimal Aggregation. A full list τ is locally Keenly optimal,if there is no full list τ⁺ that can be obtained by single transpositionof adjacent pair of elements, such that,K(τ⁺,τ₁, . . . ,τ_(r))<K(τ,τ₁, . . . ,τ_(r))where,K(τ,τ₁, . . . ,τ_(r))=1rrXi=1k(τ,τ_(i))

The function k(σ,τ) is Kendall tau distance which is the number ofpairwise disagreements between two lists σ and τ. In other words, it isimpossible to reduce the total distance across all rankers from thelocal Kemeny aggregation, by flipping an adjacent pair in τ. EveryKemeny optimal aggregation is also locally Kemeny optimal, whereas theconverse is false. Dwork et al. [C. Dwork, R. Kumar, R. Naor, and D.Sivakumar, “Rank aggregation methods for the web,” in WWW, 2001.] showthat Local Kemenization satisfies the Extended Condorcet Criterion andcan be computed in O(rm log m), where m is the size of V. The localKemeny procedure can be viewed as a stable sorting algorithm, wheregiven an initial ordering, elements di and dj are only swapped if di ispreferred to dj by the majority of rankers (τ's). It is important tonote that the initial aggregation passed to Local Kemenization may notnecessarily satisfy Condorcet criteria. However, the process of LocalKemenization produces a final ranking that is maximally consistent withthe initial aggregation, and in which Condorcet winners are at the topof the list.

Supervised Rank Aggregation

Borda and Kemeny aggregations, being motivated from social choicetheory, strive for fairness and hence treat all rankers as equallyimportant. However, fairness is not a desirable property in our setting,since we know that some individual rankers (measures) perform betterthan others in our target tasks—as evidenced by results in Tables IIIand VI. If we knew a priori which rankers are better, we could leveragethis information to produce a better aggregate ranking. In fact, giventhe ordering of a (small) set of candidates, we can estimate theperformance of individual rankers and use this to produce a betterranking on a new set of candidates. We use such an approach to producedifferent supervised rank aggregation methods, which we describe in moredetail below.

In order to accommodate supervision, we extend Borda and local Kemenyaggregation to incorporate weights associated with each input ranking.The weights correspond to the relative utility of each ranker, which maydepend on the task at hand. In this section, we focus on the task ofviral prediction as described the discussion below. As such, we weighteach ranker based on its (normalized) AUC computed on a validation(training) set of candidates, for which we know the true retweet rates.Incorporating weights in Borda aggregation is relativelystraightforward, where instead of simple averages, we take weightedaverages of Borda scores. This weighted version of Borda is presented inAlgorithm 1 in FIG. 6. When called with uniform weights, we will referto the algorithm simply as Borda. When used with weights based ontraining set performance, we will refer to it as Supervised Borda.

For supervised Local Kemenization, we incorporate weights directly insorting the initial ordering. So, instead of comparing candidates basedon the preference of the simple majority of individual rankers, we use aweighted majority. This can be achieved by using weighted votes duringthe creation of the majority table M—which represents the sum of weightsof the rankers who prefer the row candidate to the column candidate foreach pairwise comparison. This weighted Kemeny procedure is presented inAlgorithm 2 in FIG. 6. As with Supervised Borda, we select weights basedon each ranker's AUC computed on a training set. When Algorithm 2 isinvoked with uniform weights, we will refer to it as Local Kemenizationor LK. When used with the performance-based weights, we will refer to itas Supervised LK. Instead of using total orderings provided by eachranker, we can also use partial orderings (for a subset of candidates).Of particular interest, is using the partial ordering only on the top kcandidates for each ranker. We refer to this variant as Local KemenyTopK (LK TopK). In Algorithm 2, setting k to ISI (the default)corresponds to Local Kemenization on total orderings. Local Kemenizationis sensitive to the initial input ordering μ provided to the algorithm.We experimented with initializing with Borda and Supervised Borda.

Our weighted Local Kemenization algorithm can run with varying threeoptions, namely (1) with and without supervision, (2) with totalorderings or partial (top K) orderings, and (3) with different initialorderings. We experimented with several combinations of these threeoptions. By default, Local Kemenization (LK) refers to unsupervisedLocal Kemenization with total orderings and initializing with Borda. Allother variants are listed in Table VII, where the names list thedepartures from these defaults, and the initial ordering is mentioned inparentheses, e.g. Supervised LK TopK (Supervised Borda) corresponds tousing Supervised Borda for initialization, partial orderings for top Kand the supervised version of Local Kemenization. In all our experimentswith partial orderings we use the topranked 15% of candidates for eachranker.

Experimental Results

We compared the different supervised and unsupervised rank aggregationtechniques described above on the task of predicting viral potential. Asinputs to each aggregation method we use the 13 different measureslisted in Table III. Each measure is used to produce a total ordering ofpreferences over the 9625 candidates (twitter users), where ties arebroken randomly. We compared 8 rank aggregation methods (See Table VII)to the logistic regression-based approach discussed above.

Note that the results using logistic regression are based on 10-foldcross-validation. This means that the ground truth on 90% (≈9000)twitter accounts is used for training. In practice, we may expect tohave the ground truth labeling only for a small validation/training set.So here we experiment with smaller training sets, comparing performancewith increasing amounts of labeled data. We average performance,measured by AUC, over 20 runs of random stratified train-test splits fordifferent percentages of data used for training. Our results arepresented in Table VII and the most relevant comparisons can be seen inFIG. 4. As baselines, we also compared with all 13 individual measures,but in the interest of brevity we only list the best individual measure(Past Retweets) in our results.

As expected, the supervised versions of each rank aggregation methodperformed better than the unsupervised versions. This distinction can beseen clearly in FIG. 7, which compares the results of two pairs ofaggregation approaches. In FIG. 7, the performance of the Borda methodand the supervised Borda method are shown at 71 and 72, respectively.The results of the Local Kemenization are shown at 73, and the resultsof the supervised Local Kemenization are shown at 74. We also observethat all aggregation techniques improve over the best individual rankmeasures. The exception here is Local Kemenization on total orderings,which can often perform worse than Borda and Past Retweets. This iscounter to what one might expect from the work of Dwork et al. [C.Dwork, R. Kumar, R. Naor, and D. Sivakumar, “Rank aggresation methodsfor the web,” in WWW, 2001]. However, the real benefit to using LocalKemenization can be seen when it's applied only to the partial orderingof the top k candidates of each component ranking. When applied topartial orderings, LK TopK performs better than Borda. These results areimproved by using the supervised weighted version Supervised LK TopK;which are further improved by using Supervised Borda as the initialranking.

In FIG. 8 we see that, while logistic regression performs well withground truth on a large number of candidates, its performance dropssignificantly with lower levels of supervision. In contrast, oursupervised rank aggregations methods are fairly stable, consistentlybeating the best individual ranking and performing better than logisticregression in the more realistic setting of moderately-sized trainingsets. In FIG. 8, the results of the Past Retweets and of the LocalKemenization are shown at 81 and 82, respectively. The results of the LKTopK method are shown at 83, and the results of the Supervised LKTopKmethod are shown at 84. Notably, the best performing approach isSupervised LK

TABLE VII RANK AGGREGATION PERFORMANCE MEASURE IN AUC FOR VARIOUSTRAINING SET SIZES Training Sample Size (%) Ranking Method 3% 4% 5% 8%10% Borda 0.9085 0.9095 0.9087 0.9079 0.9087 Supervised Borda 0.91210.9134 0.9118 0.9101 0.9113 LK (Borda) 0.8872 0.8877 0.8843 0.88780.8848 Supervised LK 0.8872 0.8897 0.8862 0.8878 0.8857 (Borda) LK TopK(Borda) 0.9131 0.9134 0.914 0.9126 0.9143 Supervised LK 0.9137 0.91390.9149 0.9126 0.9144 TopK (Borda) LK TopK 0.9131 0.9134 0.914 0.91260.9143 (Supervised Borda) Super. LK TopK 0.9174 0.9177 0.9168 0.91510.9172 (Super. Borda) Logisitic Regression 0.8435 0.8744 0.9055 0.91490.9311 Past Retweets 0.8935 0.8947 0.8954 0.8944 0.8991TopK (Supervised Borda). Which confirms the advantages of supervisedlocally optimal order-based ranking compared to score-based aggregation,as in Borda, and unsupervised methods.Who is Going Viral?

In the discussion below, we provide a qualitative analysis of our data,taking a closer look at a number of twitterers who are most oftenretweeted. What is most interesting is the diversity in the behavior onTwitter. Table VIII shows the top 8 users with more than 500 retweets inthe week-long test period. Here we observe anecdotally that no singleone measure is a good indicator of viral potential. Comparing toprevious time periods, only the top 2 users have consistently highretweet rates. However, some others, e.g., go from 3 or 7 retweets inone week to more than 500 in the next. Looking at the actual tweets, weidentify at least 4 archetypes in our dataset:

The Web-aggregator is very active on Twitter and the Internet ingeneral. He or she posts many tweets, the majority of which haveshortened URL's to other web content. He is typically not retweeting toomuch himself, but is following a

TABLE VIII STATISTICS ON THE 8 MOST REBROADCASTED USERS. Past RetweetsMention Mentions Tweets Retweets District Past User Followers FriendsRetweets Made Received Made (test) (test) Retweets Flipbooks 25539 8129940 27 534 83 1604 2904 1029 chris- 215626 85485 1965 26 2383 72 5851110 971 brogan trinarock- 230418 189 7 0 0 2 138 1017 742 starrcardosoa 33104 119 107 26 0 60 625 990 531 lizarddawg 17187 8975 380 2548 83 1920 967 409 buzz- 92535 46169 109 42 0 173 733 831 698 editiontherealtarji 41541 27 3 1 0 1 63 727 635 littleangle- 703 298 65 22 6914 1800 694 53 meefair number of users. Examples of this category are flipbooks,chrisbrogan, buzzedition, and lizzarddawg. The topics they mention canbe diverse: some users in this group promote political views—some ofwhich are rather controversial.

The provocateur posts strong personal statements and sometimes links toprovocative external content. She is apparently an aspiring actress witha notable fan base reading her tweets. This type of effective viralcreator posts a limited number of tweets, has many followers but isfollowing only a moderate number of other users and has many moreretweets than tweets. The other user in this class is cardoso.

The self/event promoter is similar to the previous category, but isusing Twitter as one of many platforms to specifically promoteher/himself and special events. Depending on the degree of success wesee a much larger list of followers over the number of users beingfollowed. An example is trinarockstarr. Note that in the past there wasvery little activity in her account. The activity in the predictionperiod is related to the Orlando Classic Concert where she wasperforming.

The conversationalist is using twitter very extensively, as a substitutefor instant messaging, to have a group chat with friends, e.g.littleangelmee. For such users, we see a rather small number offollowers and a small list of users being followed. The most notablecharacteristic of this category is that the number of unique usersretweeting a conversationalist is very small.

The method of the present invention will be generally implemented by acomputer executing a sequence of program instructions for carrying outthe steps of the method and may be embodied in a computer programproduct comprising media storing the program instructions. Referring toFIG. 9, a computer system 100 is depicted on which the method of thepresent invention may be carried out. Processing unit 102, houses aprocessor, memory and other systems components that implement a generalpurpose processing system that may execute a computer program productcomprising media, for example a floppy disc that may be read byprocessing unit 102 through floppy drive 104.

The program product may also be stored on hard disk drives withinprocessing unit 102 or may be located on a remote system 106 such as aserver, coupled to processing unit 102, via a network interface, such asan ethernet interface. Monitor 110, mouse 112 and keyboard 114 arecoupled to processing unit 102, to provide user interaction. Scanner 116and printer 120 are provided for document input and output. Printer 120,is shown coupled to processing unit 102 via a network connection, butmay be coupled directly to the processing unit. Scanner 116 is showncoupled to processing unit 102 directly, but it should be understoodthat peripherals may be network coupled or direct coupled withoutaffecting the ability of computer system 100 to perform the method ofthe invention.

While it is apparent that the invention herein disclosed is wellcalculated to achieve the features discussed above, it will beappreciated that numerous modifications and embodiments may be devisedby those skilled in the art, and it is intended that the appended claimscover all such modifications and embodiments as fall within the truespirit and scope of the present invention.

The invention claimed is:
 1. A method of predicting users who will havea specified influence in a social network, comprising: identifying a setof users of the social network; identifying a subset of the users asinfluential users based on defined criteria; identifying a multitude ofuser behavior measures as predictors of which ones of the set of usersare said influential users; forming a composite predictor model fromsaid user predictor measures to determine which ones of the set of userswill have a specified influence in the social network in the future,including using each of said measures to provide a respective oneranking of the influence in the social network of an associated group ofusers of the set of users; aggregating said respective one rankings toform an aggregate ranking of said associated group of the users; andusing said aggregate ranking to predict which ones of the set of userswill have a specified influence in the social network in the future. 2.The method according to claim 1, wherein: in said social network,messages are sent by and among the set of users; and the using saidaggregate ranking to predict which ones of the set of users will have aspecified influence in the social network in the future includes usingsaid aggregate ranking to predict which ones of the set of users willhave the most number of the messages sent from said ones of the usersre-sent by others of the users.
 3. The method according to claim 1,wherein the using each of said measures to provide a respective oneranking of an associated group of the users includes: obtaining atraining set of data from the set of users, including determining, foreach of the users, how many messages sent by said each user in aspecified time period that were re-sent by other users; and using saidtraining set of data to determine, for each of the measures, aneffectiveness of said each measure as predicting which ones of the setof users sent the messages that were most re-sent by the other users. 4.The method according to claim 1, wherein the forming the compositepredictor model includes using all of said multitude of measures to formthe composite predictor model.
 5. The method according to claim 1,wherein the forming the composite predictor model includes using onlysome of said multitude of measures to form the composite predictormodel.
 6. The method according to claim 1, wherein the forming thecomposite predictor model includes combining all of said measuresthrough logistic regression to form the composite predictor model. 7.The method according to claim 1, wherein: the identifying a subset ofthe users includes separating the set of users into a multitude ofclasses according to degrees of influence of the users based on saiddefined criteria; and the using said composite predictor model includesusing the composite predictor model to predict how often users in thedifferent classes will have said specified influence in the socialnetwork in the future.
 8. The method according to claim 1, wherein: theidentifying a subset of the users includes separating the set of usersinto a multitude of classes according to a degree of influence of eachof the users based on said defined criteria; and the aggregating isbased on an accuracy of each of the measures as a predictor of which ofthe users are in which of the classes.
 9. The method according to claim1, wherein: the aggregating includes assigning different weights to themultitude of rankings based on defined criteria to form a supervisedrank aggregation of said measures; and the forming the compositepredictor model includes using said supervised rank aggregation of saidmeasures to form the composite predictor model.
 10. The method accordingto claim 1, wherein the identifying a subset of the users as influentialincludes: identifying a class of users whose messages have been re-sentmore than a given number of times in a given time period; using logisticregression to train a model on the user behavior measures of saididentified class of users on historical data, and then using the modelto predict a potential for re-sending of messages in a test time period.11. A system for predicting users who will have a specified influence ina social network, said social network includes an identified set ofusers and a subset of the users identified as influential users based ondefined criteria, the system comprising one or more processing unitsconfigured for: forming a composite predictor model from a multitude ofuser behavior measures identified as predictors of which ones of the setof users are the influential users to determine which ones of the set ofusers will have a specified influence in the social network in thefuture, including using each of said measures to provide a respectiveone ranking of the influence in the social network of an associatedgroup of users of the set of users; aggregating said respective onerankings to form an aggregate ranking of said associated group of users;and using said aggregate ranking to predict which ones of the set ofusers will have a specified influence in the social network in thefuture.
 12. The system according to claim 11, wherein: in said socialnetwork, messages are sent by and among the set of users; and the usingsaid aggregate ranking to predict which ones of the set of users willhave a specified influence in the social network in the future includesusing said aggregate ranking to predict which ones of the set of userswill have the most number of the messages sent by said one of the usersthat are resent by other users.
 13. The system according to claim 11,wherein, in said social network, messages are sent by and among the setof users, and wherein the comparing said measures includes: obtaining atraining set of data from the set of users, including determining, foreach user in the set of users, how many messages sent by said each userin a specified time period were re-sent by other users; and using saidtraining set of data to determine, for each of the measures, aneffectiveness of said each measure as predicting which ones of the setof users sent the messages that were most re-sent by the other users;and aggregating said measures based on said determined effectiveness ofsaid each of the measures.
 14. The system according to claim 11, whereinthe forming the composite predictor model includes combining all of saidmeasures through logistic regression to form the composite predictormodel.
 15. The system according to claim 11, wherein: the subset of theusers is identified by separating the set of users into a multitude ofclasses according to degrees of influence of said users based on saiddefined criteria; and the using said composite predictor model includesusing the composite predictor model to predict how often users in thedifferent classes will have said specified influence in the socialnetwork in the future.
 16. An article of manufacture comprising: atleast one tangible computer readable hardware medium having computerreadable program code logic tangibly embodied therein to predict userswho will have a specified influence in a social network, said socialnetwork includes an identified set of users and a subset of the set ofusers identified as influential users based on defined criteria, saidcomputer readable program code logic, when executing, performing thefollowing: forming a composite predictor model from a multitude of userbehavior measures identified as predictors of which ones of the set ofusers are the influential users to determine which ones of the set ofusers will have a specified influence in the social network in thefuture, including using each of said measures to provide a respectiveone ranking of the influence in the social network of an associatedgroup of users of the set of users; aggregating said respective onerankings to form an aggregate ranking of said associated group of users;and using said aggregate ranking to predict which ones of the set ofusers will have a specified effect in the social network in the future.17. The article of manufacture according to claim 16, wherein: in saidsocial network, messages are sent by and among the set of users; and theusing said aggregate ranking to predict which ones of the set of userswill have a specified influence in the social network in the futureincludes using said aggregate rankings to predict which ones of the setof users will have the most the number of the messages sent by said onesof the users that are re-sent by other users.
 18. The article ofmanufacture according to claim 16, wherein, in said social network,messages are sent by and among the set of users, and wherein theaggregating said measures includes: obtaining a training set of datafrom the set of users, including determining, for each of the users inthe set of users, how many messages sent by said each user in aspecified time period were re-sent by other users; and using saidtraining set of data to determine, for each of the measures, aneffectiveness of said each measure as predicting which ones of the setof users sent the messages that were most re-sent by the other users;and aggregating said measures based on said determined effectiveness ofsaid each of the measures.
 19. The article of manufacture according toclaim 16, wherein: the identifying a subset of the users includesseparating the set of users into a multitude of classes according todegrees of influence of the users based on said defined criteria; andthe using said composite predictor model includes using the compositepredictor model to predict how often users in the different classes willhave said specified influence in the social network in the future. 20.The article of manufacture according to claim 16, wherein the formingthe composite predictor model includes: assigning different weights tothe multitude of measures based on defined criteria to form a supervisedrank aggregation of said measures; and using said supervised rankaggregation of said measures to form the composite predictor model. 21.A method of predicting users who will have a specified influence in asocial network, wherein said social network includes an identified setof users and a subset of the users identified as influential users basedon defined criteria, in said social network, messages are sent by andamong the set of users, and said influential users are identified basedon the messages sent by the users, the method comprising: identifying amultitude of user behavior measures as predictors of which ones of theset of users are the influential users; forming a composite predictormodel from said user predictor measures to determine which ones of theset of users will have a specified influence in the social network inthe future, including using each of said measures to provide arespective one ranking of the influence in the social network of anassociated group of users of the set of users; aggregating saidrespective one rankings to form an aggregate ranking of said associatedgroup of the users; and using said aggregate ranking to predict whichones of the set of users will have a specified influence in the socialnetwork in the future.
 22. The method according to claim 21, wherein: insaid social network, messages are sent by and are among the set ofusers; and the using said aggregate ranking to predict which ones of theset of users will have a specified influence in the social network inthe future includes using said aggregate ranking to predict which onesof the set of users will have the most number of the messages sent fromsaid ones of the users that are re-sent by other users.
 23. The methodaccording to claim 21, wherein the aggregating said measures includes:obtaining a training set of data from the set of users, includingdetermining, for each user in the set of users, how many messages sentby said each user in a specified time period were re-sent by otherusers; and using said training set of data to determine, for each of themeasures, an effectiveness of said each measure in predicting which onesof the set of users sent the messages that were most re-sent by theother users; and aggregating said measures based on said determinedeffectiveness of said measures.
 24. The method according to claim 21,wherein the forming the composite predictor model includes combining allof said measures through logistic regression to form the compositepredictor model.
 25. The method according to claim 21, wherein theforming the composite predictor model includes: assigning differentweights to the multitude of measures based on defined criteria to form asupervised rank aggregation of said measures, including weighting eachof said measures based on a normalized correlation computed on atraining set of data; and using said supervised rank aggregation of saidmeasures to form the composite predictor model.